The variables ‘PoolQC’, ‘MiscFeature’ and ‘Alley’ are missing more than 93% of data while the variable ‘Fence’ is missing about 80% of data. The variable ‘FireplaceQu’ is missing a liitle bit less than 50% of its data whereas the variable ‘LotFrontage’ is missing about 17% of its data. All other variables are either missing less that 5% of the data or don’t contain missing vallues.
Based on the data description, PoolQC is NaN when there is no pool, hence, I changed missing value to ‘No_Pool’. Similar approaches were applied to ‘Fence’, ‘FireplaceQu’,‘garage’, and ‘basement’related features. Next, variable mean was used to fill missing data for the ’LotFrontage’ and ‘MasVnrArea’ variables.
There are 4 outliers whose GrLivArea are above 4000 square ft.
After filling missing value and delete outliers, I split the dataset into traing and testing sets using 0.75 partition.
For categorical variable, check distribution of SalePrice with respect to variable values, only applied to training set.
For numerical variable, check correlation between SalePrice and variable values, only applied to training set.
Combine numerical and categorical to evaluate feature importance, only applied to training set.
Recursive Feature Elimination (RFE) method also identified 16 candidate variables: “GrLivArea”, “Neighborhood”, “OverallQual”, “BsmtFinSF1”, “TotalBsmtSF”, “1stFlrSF”, “GarageArea”, “2ndFlrSF”, “GarageCars”, “LotArea”, “ExterQual”, “GarageType”, “FireplaceQu”, “BsmtFinType1”, “KitchenQual”, and “OverallCond”.